public repository
Creating a Public Repository for Joining Private Data
How can one publish a dataset with sensitive attributes in a way that both preserves privacy and enables joins with other datasets on those same sensitive attributes? This problem arises in many contexts, e.g., a hospital and an airline may want to jointly determine whether people who take long-haul flights are more likely to catch respiratory infections. If they join their data by a common keyed user identifier such as email address, they can determine the answer, though it breaks privacy. This paper shows how the hospital can generate a private sketch and how the airline can privately join with the hospital's sketch by email address. The proposed solution satisfies pure differential privacy and gives approximate answers to linear queries and optimization problems over those joins. Whereas prior work such as secure function evaluation requires sender/receiver interaction, a distinguishing characteristic of the proposed approach is that it is non-interactive. Consequently, the sketch can be published to a repository for any organization to join with, facilitating data discovery. The accuracy of the method is demonstrated through both theoretical analysis and extensive empirical evidence.
- Health & Medicine (1.00)
- Transportation > Air (0.60)
- Information Technology > Security & Privacy (0.43)
Creating a Public Repository for Joining Private Data
How can one publish a dataset with sensitive attributes in a way that both preserves privacy and enables joins with other datasets on those same sensitive attributes? This problem arises in many contexts, e.g., a hospital and an airline may want to jointly determine whether people who take long-haul flights are more likely to catch respiratory infections. If they join their data by a common keyed user identifier such as email address, they can determine the answer, though it breaks privacy. This paper shows how the hospital can generate a private sketch and how the airline can privately join with the hospital's sketch by email address. The proposed solution satisfies pure differential privacy and gives approximate answers to linear queries and optimization problems over those joins.
- Health & Medicine (1.00)
- Transportation > Air (0.63)
- Information Technology > Security & Privacy (0.40)
Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
Ranathunga, Surangika, de Silva, Nisansa
Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availability may not be always correct. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms and the inclusion in pre-trained multilingual models. We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of family, geographical location, GDP and the speaker population of languages and provide possible reasons for this disparity, along with some suggestions to overcome the same.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > Sri Lanka (0.04)
- (43 more...)
- Information Technology > Communications > Web (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Information Management (0.94)
- (6 more...)
Exploring the Most Popular Machine Learning and Deep Learning GitHub Repositories
Currently, machine learning and deep learning are two subjects of broad interest in both academia and industry. Given their immense popularity, there are hundreds of thousands of GitHub repositories that exist, which contain the source code, documentation, and other useful information on a vast number projects related to either topic. In this article, I explain the process for how I collected, cleaned, and visualized the data on a selection of the most popular machine learning and deep learning GitHub repositories. I also discuss the trends, patterns, and key findings that are related to each of the visualizations that I created. You can find all of my source code that supports this article in my own GitHub repository here.
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe (0.04)
- Asia > China (0.04)
Proposed US AI Bill Costs May Outweigh Benefits
Senator Ron Wyden (D-Ore.), with Senator Cory Booker (D-N.J.) and Representative Yvette Clarke (D-N.Y.), introduced in early February the Algorithmic Accountability Act of 2022. This bill aims to bring transparency and oversight of software, algorithms and other automated systems that are used to make automated decisions. "As algorithms and other automated decision systems take on increasingly prominent roles in our lives, we have a responsibility to ensure that they are adequately assessed for biases that may disadvantage minority or marginalized communities," said Sen. Booker. The bill requires companies to conduct impact assessments for bias, effectiveness and other factors, when using automated decision systems to make critical decisions. The bill also gives the Federal Trade Commission (FTC) the authority to require the companies to comply with this bill and to create a public repository of these automated systems.
- North America > United States (1.00)
- Europe (0.05)
- Law > Statutes (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)